Part 1: Text Processing

*From Starter code

HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746

In this R notebook, we process the raw textual data for our data analysis.

Step 0 - Load all the required libraries

From the packages’ descriptions:

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          4.3                         
## year           2017                        
## month          11                          
## day            30                          
## svn rev        73796                       
## language       R                           
## version.string R version 3.4.3 (2017-11-30)
## nickname       Kite-Eating Tree

Step 1 - Load the data to be cleaned and processed

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)

Step 2 - Preliminary cleaning of text

We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.

Step 3 - Stemming words and converting tm object to tidy object

Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.

Step 4 - Creating tidy format of the dictionary to be used for completing stems

We also need a dictionary to look up the words corresponding to the stems.

Step 5 - Removing stopwords that don’t hold any significant information for our data set

We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.

Step 6 - Combining stems and dictionary into the same tibble

Here we combine the stems and the dictionary into the same “tidy” object.

Step 7 - Stem completion

Lastly, we complete the stems by picking the corresponding word with the highest frequency.

Step 8 - Pasting stem completed individual words into their respective happy moments

We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.

Step 9 - Keeping a track of the happy moments with their own ID

Exporting the processed text data into a CSV file

write_csv(hm_data, "../output/processed_moments.csv")

Step 1 - Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

Combine both the data sets and keep the required columns for analysis

We select a subset of the data that satisfies specific row conditions.

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         predicted_category,
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))

Check the data for correct data types, outliers, and possible errors

summary(hm_data)
##       wid        original_hm           gender            marital         
##  Min.   :    1   Length:94574       Length:94574       Length:94574      
##  1st Qu.:  402   Class :character   Class :character   Class :character  
##  Median : 1097   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 2680                                                           
##  3rd Qu.: 3316                                                           
##  Max.   :13839                                                           
##   parenthood        reflection_period     age           
##  Length:94574       hours_24:46932    Length:94574      
##  Class :character   months_3:47642    Class :character  
##  Mode  :character                     Mode  :character  
##                                                         
##                                                         
##                                                         
##    country          ground_truth_category predicted_category
##  Length:94574       Length:94574          Length:94574      
##  Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character      Mode  :character  
##                                                             
##                                                             
##                                                             
##      text               count        
##  Length:94574       Min.   :  1.000  
##  Class :character   1st Qu.:  3.000  
##  Mode  :character   Median :  5.000  
##                     Mean   :  6.163  
##                     3rd Qu.:  7.000  
##                     Max.   :509.000
hm_data2 <- hm_data
hm_data2[,c(3:5,8:10)] <- as.data.frame(sapply(hm_data[,c(3:5,8:10)], as.factor))
summary(hm_data2)
##       wid        original_hm        gender       marital      parenthood
##  Min.   :    1   Length:94574       f:38734   married:40974   n:58859   
##  1st Qu.:  402   Class :character   m:55840   single :53600   y:35715   
##  Median : 1097   Mode  :character                                       
##  Mean   : 2680                                                          
##  3rd Qu.: 3316                                                          
##  Max.   :13839                                                          
##                                                                         
##  reflection_period     age               country     
##  hours_24:46932    Length:94574       USA    :73475  
##  months_3:47642    Class :character   IND    :16629  
##                    Mode  :character   VEN    :  546  
##                                       CAN    :  531  
##                                       GBR    :  355  
##                                       (Other): 2891  
##                                       NA's   :  147  
##       ground_truth_category        predicted_category     text          
##  affection       : 4555     achievement     :31944    Length:94574      
##  achievement     : 4044     affection       :32086    Class :character  
##  bonding         : 1663     bonding         :10131    Mode  :character  
##  enjoy_the_moment: 1426     enjoy_the_moment:10479                      
##  leisure         : 1255     exercise        : 1147                      
##  (Other)         :  436     leisure         : 7079                      
##  NA's            :81195     nature          : 1708                      
##      count        
##  Min.   :  1.000  
##  1st Qu.:  3.000  
##  Median :  5.000  
##  Mean   :  6.163  
##  3rd Qu.:  7.000  
##  Max.   :509.000  
## 
#Since we're focusing on an age group, let's clean up age
hm_data2$age <- as.numeric(hm_data2$age)
## Warning: NAs introduced by coercion
summary(hm_data2)
##       wid        original_hm        gender       marital      parenthood
##  Min.   :    1   Length:94574       f:38734   married:40974   n:58859   
##  1st Qu.:  402   Class :character   m:55840   single :53600   y:35715   
##  Median : 1097   Mode  :character                                       
##  Mean   : 2680                                                          
##  3rd Qu.: 3316                                                          
##  Max.   :13839                                                          
##                                                                         
##  reflection_period      age            country     
##  hours_24:46932    Min.   :  2.00   USA    :73475  
##  months_3:47642    1st Qu.: 25.00   IND    :16629  
##                    Median : 30.00   VEN    :  546  
##                    Mean   : 31.96   CAN    :  531  
##                    3rd Qu.: 35.00   GBR    :  355  
##                    Max.   :233.00   (Other): 2891  
##                    NA's   :78       NA's   :  147  
##       ground_truth_category        predicted_category     text          
##  affection       : 4555     achievement     :31944    Length:94574      
##  achievement     : 4044     affection       :32086    Class :character  
##  bonding         : 1663     bonding         :10131    Mode  :character  
##  enjoy_the_moment: 1426     enjoy_the_moment:10479                      
##  leisure         : 1255     exercise        : 1147                      
##  (Other)         :  436     leisure         : 7079                      
##  NA's            :81195     nature          : 1708                      
##      count        
##  Min.   :  1.000  
##  1st Qu.:  3.000  
##  Median :  5.000  
##  Mean   :  6.163  
##  3rd Qu.:  7.000  
##  Max.   :509.000  
## 
#Now we have the proper data types for our variables
#Other observations:
#1. Respondents are more male than female
#2. More single than married
#3. More non-parents than parents
#4 near equal split between reflection periods
#5 large majority from the US, followed by India
#6 Most happy moments fall under the categories of achievement, affection, bonding, and enjoy_the_moment


hist(hm_data2$age)

#There appear to be strange outliers (over 100 years old and as young as 2 years old) - which seem unlikely to be legitimate.
hm_data2 <- hm_data2[which(hm_data2$age<100 & hm_data2$age>5),]
hist(hm_data2$age)

#Majority of respondents are within the 20-40 age range.

hist(hm_data2$count)

For this analysis, I want to focus on a personal curiosity - for peers within my age group (26-30), what brings them happiness? How do males and females differ in this regard?

hm_data2$peer_agegroup <- as.factor(ifelse((hm_data2$age>25&hm_data2$age<31),"peer", "non-peer"))
summary(hm_data2)
##       wid        original_hm        gender       marital      parenthood
##  Min.   :    1   Length:94340       f:38650   married:40890   n:58715   
##  1st Qu.:  403   Class :character   m:55690   single :53450   y:35625   
##  Median : 1099   Mode  :character                                       
##  Mean   : 2682                                                          
##  3rd Qu.: 3319                                                          
##  Max.   :13839                                                          
##                                                                         
##  reflection_period      age           country     
##  hours_24:46821    Min.   :17.00   USA    :73328  
##  months_3:47519    1st Qu.:25.00   IND    :16551  
##                    Median :30.00   VEN    :  546  
##                    Mean   :31.86   CAN    :  531  
##                    3rd Qu.:35.00   GBR    :  352  
##                    Max.   :95.00   (Other): 2885  
##                                    NA's   :  147  
##       ground_truth_category        predicted_category     text          
##  affection       : 4536     achievement     :31865    Length:94340      
##  achievement     : 4038     affection       :31969    Class :character  
##  bonding         : 1662     bonding         :10116    Mode  :character  
##  enjoy_the_moment: 1425     enjoy_the_moment:10467                      
##  leisure         : 1254     exercise        : 1146                      
##  (Other)         :  434     leisure         : 7071                      
##  NA's            :80991     nature          : 1706                      
##      count          peer_agegroup  
##  Min.   :  1.000   non-peer:65663  
##  1st Qu.:  3.000   peer    :28677  
##  Median :  5.000                   
##  Mean   :  6.168                   
##  3rd Qu.:  7.000                   
##  Max.   :509.000                   
## 
summary(hm_data2%>%
          filter(hm_data2$peer_agegroup=="peer"))
##       wid        original_hm        gender       marital      parenthood
##  Min.   :    2   Length:28677       f:10793   married:11092   n:19718   
##  1st Qu.:  354   Class :character   m:17884   single :17585   y: 8959   
##  Median :  925   Mode  :character                                       
##  Mean   : 2438                                                          
##  3rd Qu.: 2788                                                          
##  Max.   :13828                                                          
##                                                                         
##  reflection_period      age           country     
##  hours_24:14074    Min.   :26.00   USA    :20994  
##  months_3:14603    1st Qu.:27.00   IND    : 6378  
##                    Median :28.00   CAN    :  156  
##                    Mean   :27.98   PHL    :  129  
##                    3rd Qu.:29.00   VEN    :  102  
##                    Max.   :30.00   (Other):  894  
##                                    NA's   :   24  
##       ground_truth_category        predicted_category     text          
##  affection       : 1406     achievement     :9627     Length:28677      
##  achievement     : 1208     affection       :9232     Class :character  
##  leisure         :  552     bonding         :3200     Mode  :character  
##  bonding         :  530     enjoy_the_moment:3169                       
##  enjoy_the_moment:  476     exercise        : 411                       
##  (Other)         :  127     leisure         :2610                       
##  NA's            :24378     nature          : 428                       
##      count          peer_agegroup  
##  Min.   :  1.000   non-peer:    0  
##  1st Qu.:  3.000   peer    :28677  
##  Median :  5.000                   
##  Mean   :  6.357                   
##  3rd Qu.:  7.000                   
##  Max.   :509.000                   
## 
#as a whole, it does not appear that those aged 26-30 have very different happiness categories vs the whole group - roughly 2/3 still fall under achievement and affection

Create a bag of words using the text data

bag_of_words <-  hm_data2 %>%
  unnest_tokens(word, text)

#Let's see the word frequencies in descending order
word_count <- bag_of_words %>%
  count(word, sort = TRUE)

##Plot
barplot(word_count[1:20,]$n, las = 2, names.arg = word_count[1:20,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

#for peer group
bag_of_words_peer <-  hm_data2 %>%
  dplyr::filter(hm_data2$peer_agegroup=="peer")%>%
  unnest_tokens(word, text)

#Let's see the word frequencies in descending order
word_count_peer <- bag_of_words_peer %>%
  dplyr::group_by(bag_of_words_peer$gender)%>%
  count(word, sort = TRUE)

####how to show this by group
barplot(word_count_peer[1:20,]$n, las = 2, names.arg = word_count_peer[1:20,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

total_words <- sum(word_count$n)
total_words_peer <- sum(word_count_peer$n)

freq_by_rank <- word_count %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total_words)

freq_by_rank
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

plot(hm_data2$gender, hm_data2$predicted_category)

#Here I attempt to see any significance from a TF-IDF analysis, but looking at the output it does not seem that meaningful. Many of the words appear to be misspelled - "selfit", "promotionit", "ruut", etc. and some don't even look like english - "thekkady"
book_words <- word_count_peer %>%
  bind_tf_idf(word, bag_of_words_peer$gender, n)

book_words %>%
  select(-total_words) %>%
  arrange(desc(tf_idf))
book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(`bag_of_words_peer$gender`) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = `bag_of_words_peer$gender`)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~`bag_of_words_peer$gender`, ncol = 2, scales = "free") +
  coord_flip()
## Warning in mutate_impl(.data, dots): Unequal factor levels: coercing to
## character
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector

## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Selecting by tf_idf

#Only a handful of words appear meaningful:
#f: mommy, ebook
#m: split, prominent, nba, gf, flourishing, experiential, bitcoin

Create bigrams using the text data

hm_bigrams <- hm_data2 %>%
  filter(count != 1) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts <- hm_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)

bigram_counts
hm_bigrams_peer <- hm_data2 %>%
  dplyr::filter(hm_data2$peer_agegroup=="peer")%>%
  filter(count != 1) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_counts_peer <- hm_bigrams_peer %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)

bigram_counts_peer_f <- hm_bigrams_peer%>%
  dplyr::filter(hm_bigrams_peer$gender=="f")%>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)

bigram_counts_peer_m <- hm_bigrams_peer%>%
  dplyr::filter(hm_bigrams_peer$gender=="m")%>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)

#the #1 bi-gram for men aged 26-30 is... video games! Additionally, #9 is "played video" [likely games]

bigram_counts_peer_m
bigram_counts_peer_f
bigram_counts_mothers <- hm_bigrams %>%
  dplyr::group_by(hm_bigrams$gender)%>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)
set.seed(123)
wordcloud(words = word_count$word, freq = word_count$n, min.freq = 100, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))

wordcloud(words = word_count_peer$word, freq = word_count_peer$n, min.freq = 100, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))

class(bag_of_words$age)
## [1] "numeric"
class(hm_data)
## [1] "tbl_df"     "tbl"        "data.frame"
#pairwise correlation
#associated words
dtm <- cast_dtm(data=word_count_peer, term = word, document = bag_of_words_peer$gender, value = n)

hm_lda <- LDA(dtm, k=2, control=list(seed=1234))
summary(word_count_peer)
##  bag_of_words_peer$gender     word                 n          
##  f:6371                   Length:14879       Min.   :   1.00  
##  m:8508                   Class :character   1st Qu.:   1.00  
##                           Mode  :character   Median :   2.00  
##                                              Mean   :  12.25  
##                                              3rd Qu.:   6.00  
##                                              Max.   :2256.00
hm_lda
## A LDA_VEM topic model with 2 topics.
hm_topics <- tidy(hm_lda, matrix="beta")
hm_topics
hm_top_terms <- hm_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

hm_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

beta_spread <- hm_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread <- beta_spread[order(beta_spread$log_ratio),]
dim(beta_spread)
## [1] 267   4
beta_spread_a <- beta_spread[1:10,]
beta_spread_b <- beta_spread[258:267,]
beta_spread_fin <- rbind(beta_spread_a, beta_spread_b)
beta_spread_fin
ggplot(data = beta_spread_fin, aes(y=beta_spread_fin$log_ratio, x=beta_spread_fin$term)) + geom_bar(stat='identity')+  coord_flip()

ggplot(data = beta_spread_fin, aes(y=beta_spread_fin$log_ratio, x=reorder(beta_spread_fin$term,-beta_spread_fin$log_ratio))) + geom_bar(stat='identity', position='dodge') +coord_flip()

Things I want to do: 1. show word frequencies by gender, parenthood, etc 2. show bi-grams by gender, etc. 3. plot overall word freq 4. plot highest freq bi-grams